humaneval: by examples

Home   Doc/Code

Not solved by any model

There are 6 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
HumanEval/129, HumanEval/130, HumanEval/132, HumanEval/145, HumanEval/163, HumanEval/32

Problems solved by 1 model only

example_link model min_elo
HumanEval/93 xwincoder-34b 1194.233
HumanEval/108 claude-3-sonnet-20240229 1100.360
HumanEval/137 Qwen--Qwen1.5-72B-Chat 1074.630

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link acc tau
HumanEval/54 0.163 -0.123
HumanEval/55 0.878 -0.119
HumanEval/126 0.061 -0.055
HumanEval/137 0.020 0.025
HumanEval/47 0.939 0.035
HumanEval/108 0.020 0.042
HumanEval/97 0.796 0.045
HumanEval/122 0.673 0.047
HumanEval/116 0.776 0.062
HumanEval/11 0.857 0.082

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum Elo to solve each problem.